Sequencing and Raw Sequence Data Quality Control ◾ 37
Figure 1.29 shows a common deterioration of the quality of bases toward the end of the
reads produced by short-read sequencing instruments. We can also notice that some qual-
ity scores in some positions are low as 2 Phred (probability of error is 0.6).
The report shows three failures and a single warning: failed per base sequence quality
(Figure 1.29), failed per base sequence content and failed k-mer content (Figure 1.30), and
overrepresented sequences warning (Figure 1.31).
The QC processing strategies are different from a FASTQ file to another depending on
the failed metrics. Understanding the problem always gives a good idea about which kinds
of QC processing to perform. In our example file, we will begin by filtering the low-quality
reads and clipping the overrepresented sequences and then we will run FastQC again to see
how the quality is improved.
First, we will try to fix the per base quality score of the reads in the FASTQ file by
using “fastq_quality_filter” to keep the reads that have 80% of the bases which have qual-
ity scores equal or greater than 28. The following script performs filtering (the output file
is “filtered.fastq”), runs FastQC to generate the new QC report, and then runs Firefox to
display the QC report on the Internet browser:
fastq_quality_filter \
-i bad.fastq \
-q 28 \
FIGURE 1.30 Failed per base sequence content and k-mer content.
FIGURE 1.31 Overrepresented sequences that raised a warning.